final_project.Rmd.

  • HW: Final Project
  • Fall 2021, DSPA (HS650)
  • Name: Kevin Wu
  • SID: ####0012 (last 4 digits only)
  • UMich E-mail:
  • I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
  • Remember that students are allowed and encouraged to discuss, on a conceptual level, the problems with your class mates, however, this can not involve the exchange of actual code, printouts, solutions, e-mails or other explicit electronic or paper handouts.

1 Final Project Introduction

The PSID (Panel Study of Income Dynamics) began in 1968, studying over 18,000 individuals living in 5,000 families. In 1997, the PSID launched the CDS (Child Development Supplement) I-III studying the children of these families. The TAS (Transition into Adulthood Supplement) began in 2005 which collected data from these very children transitioning into young adults. The following data is a custom dataset composed of the CDSII (children interviewed in 2001) and follows these children in the TAS 2011.

Want to know (i.e. dependent variable) “change” = 2 (desirable move/state) vs 0 (steady but not ideal move/state) & 1 (undesirable move/state)

The questions I’m looking to answer are

  • What model makes the best prediction for teens that flourish?
  • What are the best predictors?

These are the methods of approaches I have considered to analyze the data

  1. Use feature selection models (Boruta and Stepwise Feature Selection) given the nature of my study (link).
  2. Use clustering nearest neighbors (kNN, k-Means) using one-hot encoding because via clustering I can determine the trends within the clusters that may be predictive in nature (link).
  3. I chose not to use apriori techniques because it expects the itemsets to have repetition (link).
  4. I chose not to use classification trees or any machine learning algorithms that were obtuse and did not reveal the nature of the model because again the nature of my study.
  5. I chose to minimize the use of linear modeling / regression d/t the data being nominal in nature and one-hot encoding dealing only with binary data, making the modeling trivial and useless.

2 Dataset Details

The following dataset I am analyzing is a filtered version of a custom dataset Dr. Ashley Palmer compiled from a study done by Dr. Corey Keyes in 2006 using the CDSII data.

Some features were removed from Dr. Palmer’s custom dataset because they were derived from the original CDSII and TAS2011 questions that are used to create the “change” features - the features that I am studying.

In the effort to factorialize “mhstatus” and “change” I had to either remove the missing values, or account for them. Since there is no way to determine whether a missing value indicates a languishing status (perhaps due to the incomplete nature of the survey) or an incomplete status, I chose to remove them as my main focus is what features aid in positive changes. As such, by removing 30% or 835 participant surveys, the data is heavily skewed toward those experiencing positive changes.

Upon removing the missing data, I have removed the data being skewed toward stable but unideal states and states of decline. For this reason I did not need to do any sort of miniority boosting or downsampling the majority sample via SMOTE.

Some data had some NA and I will determine the appropriateness of imputing the missing values. Simply removing all observations with NA as a feature leaves us with no data to analyze. So the question is whether imputation or substituting NA’s with a dummy value is appropriate.

I also created a separate dataset containing only objective data, as through my modeling, I noticed a lot of the subjective indicators were strongly influencing the outcomes. I wanted to see what was contributing to these subjective indicators, so I created a dataset of only “objective” features.

Many features are nominal, making it inappropriate to scale the features. FIX THIS……..With roughly 300 features that are similar in nature…….FIX THIS

Header File: Defining Datasets

#load and convert Stata data into R dataframe frames into environment

dim(df_full)
## [1] 1966  267
dim(df_objective)
## [1] 1966  207
convertStatDataType <- function(strType) {
  mapping <- c("%8.0g"="byte", "%8.0g"="int", "%12.0g"="long", "%9.0g"="float", "%10.0g"="double", "%#s"="str#", "%9s"="strL")
  mapping[strType]
}

convertDataframeDisplay <- function(df_convert) {
  dfDisplay <- data.frame(matrix(ncol = 0, nrow = ncol(df_convert)))
  vectName <- vector(length = ncol(df_convert))
  vectLabel <- vector(length = ncol(df_convert))
  vectFormat <- vector(length = ncol(df_convert))
  vectDescriptions <- vector(length = ncol(df_convert))
  
  for(i in 1:ncol(df_convert)) {
    vectName[i] <- names(df_convert[, i])
    if(is.null(attributes(df_convert[[i]])$label) == F)
      vectLabel[i] <- attributes(df_convert[[i]])$label
    if(is.null(attributes(df_convert[[i]])$format.stata) == F)
      vectFormat[i] <- lapply(attributes(df_convert[[i]])$format.stata, convertStatDataType)
    if(is.null(attributes(df_convert[[i]])$class) == F && attributes(df_convert[[i]])$class == "factor") {
      vectFormat[i] <- "factor"
      vectDescriptions[i] <- str_flatten(attributes(df_convert[[i]])$levels, "<br />")
    }
    if(is.null(attributes(df_convert[[i]])$class) == F && attributes(df_convert[[i]])$class != "factor" && attributes(df_convert[[i]])$class[3] == "double")
      vectFormat[i] <- "double"
    if(is.null(attributes(df_convert[[i]])$labels) == F && length(attributes(df_convert[[i]])$labels) > 0) {
      vectLabels <- list(length(attributes(df_convert[[i]])$labels))
      for(j in 1:length(attributes(df_convert[[i]])$labels)) {
        vectLabels[j] <- as.character(attributes(df_convert[[i]])$labels[j])
        if(is.null(names(attributes(df_convert[[i]])$labels[j])) == F && as.character(names(attributes(df_convert[[i]])$labels[j])) != "Actual number")
          vectLabels[j] <- paste(c(vectLabels[j], ":", as.character(names(attributes(df_convert[[i]])$labels[j]))), collapse = " ")
        else if(as.character(names(attributes(df_convert[[i]])$labels[j])) == "Actual number")
          vectLabels[j] <- paste0(vectLabels[j], ":::")
      }
      vectDescriptions[i] <- str_flatten(vectLabels, "<br />")
      vectDescriptions[i] <- str_replace_all(vectDescriptions[i], ":::<br />", ", ")
    }
  }
  dfDisplay$ColumnName <- vectName
  dfDisplay$Label <- vectLabel
  dfDisplay$Format<- vectFormat
  dfDisplay$Labels <- vectDescriptions
  return(dfDisplay)
}

2.1 Filtered Dataframe aka Full Dataset

df_display <- convertDataframeDisplay(df_full_display)
datatable(df_display, options = list(
  pageLength=10,
  lengthMenu=c(10,50,100,150,250,300)
  ),
  escape=F
)

2.2 Filtered Dataframe with Only Objective Data aka Objective Dataset

df_objective_display <- convertDataframeDisplay(df_objective_display)
datatable(df_objective_display, options = list(
  pageLength=10,
  lengthMenu=c(10,50,100,150,250,300)
  ),
  escape=F
)

3 Data Distribution

plot_ly(x = ~mhstatus, type="histogram") %>%
  layout(title = "Distribution of Mental Health Status", xaxis = list(title = "Mental Health Status"), bargap=0.1,
         legend = list(orientation = 'h', title = list(text = "<b>Mental Health</b>")))

3.1 Comparison Between Training and Testing Data

prop.table(table(mhstatus_train))
## mhstatus_train
## Languishing to Languishing    Languishing to Moderate 
##                0.002544529                0.020356234 
## Languishing to Flourishing    Moderate to Languishing 
##                0.015903308                0.016539440 
##       Moderate to Moderate    Moderate to Flourishing 
##                0.364503817                0.330788804 
## Flourishing to Languishing Flourishing to Flourishing 
##                0.002544529                0.246819338 
##                    Missing 
##                0.000000000
prop.table(table(mhstatus_test))
## mhstatus_test
## Languishing to Languishing    Languishing to Moderate 
##                0.005076142                0.025380711 
## Languishing to Flourishing    Moderate to Languishing 
##                0.017766497                0.020304569 
##       Moderate to Moderate    Moderate to Flourishing 
##                0.390862944                0.329949239 
## Flourishing to Languishing Flourishing to Flourishing 
##                0.000000000                0.210659898 
##                    Missing 
##                0.000000000
prop.table(table(df_full_train_scaled$PositiveChange))
## 
## -1.06601041843707 0.937599985623426 
##         0.4923858         0.5076142
prop.table(table(df_full_test_scaled$PositiveChange))
## 
## -1.06601041843707 0.937599985623426 
##         0.4923858         0.5076142

3.2 To Impute or Not to Impute.

We will compare the mean and median for data that has been imputed vs data that have not been imputed. While the patterns appear similar, the y-axis for both the means and medians are vastly different. To preserve the integrity of the data, I will not utilize imputation. I tried doing a violin plot, however it appears there is little variation between among the data.

library(vioplot)

plot_violin <- function(df, df_impute) {
  filter <- c("yaearnings_10", "Q23L30B")
  df <- df[ , !names(df) %in% filter]
  df_impute <- df_impute[ , !names(df_impute) %in% filter]
  df_unimputed_long <- gather(df, feature, measurement)
  feature_unimputed <- df_unimputed_long$feature
  feature_label_unimputed <- unlist(lapply(feature_unimputed, fieldname_to_description))
  measurement_unimputed <- df_unimputed_long$measurement
#  measurement_unimputed <- unlist(lapply(df_unimputed_long$measurement, function(x) { if(x == "-Inf") x = 0; return(x)}))
  df_unimputed_plotly <- data.frame(Feature = feature_unimputed, FeatureLabel = feature_label_unimputed, Measurement = measurement_unimputed)

  df_imputed_long <- gather(df, feature, measurement)
  feature_imputed <- df_imputed_long$feature
  feature_label_imputed <- unlist(lapply(feature_imputed, fieldname_to_description))
  measurement_imputed <- df_imputed_long$measurement
#  measurement_imputed <- unlist(lapply(df_imputed_long$measurement, function(x) { if(x == "-Inf") x = 0; return(x)}))
  df_imputed_plotly <- data.frame(Feature = feature_imputed, FeatureLabel = feature_label_imputed, Measurement = measurement_imputed)

  p <- plot_ly(type="violin")
  p <- p %>% add_trace(data = df_unimputed_plotly, x = ~FeatureLabel, y = ~Measurement, legendgroup = "unimputed", scalegroup = "unimputed", name="unimputed", side = "negative", box = list(visible = T), meanline = list(visible = T), color = "red")
  p <- p %>% add_trace(data = df_imputed_plotly, x = ~FeatureLabel, y = ~Measurement, legendgroup = "imputed", scalegroup = "imputed", name="imputed", side = "postive", box = list(visible = T), meanline = list(visible = T), color = "Green")
  p <- p %>% layout(
    xaxis = list(title = "Fieldnames"), yaxis = list(title = "Distribution", zeroline = F)
  )
  p
  return(p)
}


plot_summary <- function(df, string_type) {
  sum_mat <- summary(df)
  vec_medians <- sum_mat[3, ]
  vec_means <- sum_mat[4, ]
  vec_fieldnames <- trimws(colnames(sum_mat))
  for(i in 1:length(vec_medians)) {
    vec_medians[i] <- as.numeric(str_replace(vec_medians[i], "Median :", ""))
    vec_means[i] <-  as.numeric(str_replace(vec_means[i], "Mean   :", ""))
  }
  df_new <- data.frame(
    medians = vec_medians,
    means = vec_means,
    fieldnames = vec_fieldnames
  )
  p <- plot_ly()
  p <- p %>% add_trace(data = df_new, x = ~fieldnames, y = ~means,  type = "bar", name="Means")
  p <- p %>% layout(title=paste0("Distribution of Means of Summary for ", string_type), yaxis = list(title = "Count", xaxis = list(title = "Field Names")))
  q <- plot_ly()
  q <- q %>% add_trace(data = df_new, x = ~fieldnames, y = ~medians, type = "bar", name="Medians")
  q <- q %>% layout(title=paste0("Distribution of Medians of Summary for ", string_type), yaxis = list(title = "Count", xaxis = list(title = "Field Names")))

  return(list(p, q))
}


set.seed(1234)
df_full_impute_forest <- as.data.frame(missForest::missForest(as.matrix(df_full_na), maxiter=1)$ximp)
##   missForest iteration 1 in progress...done!
plot_violin(df_full_na, df_full_impute_forest)
plotted_summary_imputed <- plot_summary(df_full_impute_forest, "Full Dataset - Random Forest")
plotted_summary_unimputed <- plot_summary(df_full_na, "Full Dataset - Non Imputed")

# Means
plotted_summary_imputed[[1]]
plotted_summary_unimputed[[1]]
# Medians
plotted_summary_imputed[[2]]
plotted_summary_unimputed[[2]]

4 Summary